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Abstract 

Background: Despite increased investment in pharmaceutical research and development, fewer and fewer new 
drugs are entering the marketplace. This has prompted studies in repurposing existing drugs for use against 
diseases with unmet medical needs. A popular approach is to develop a classification model based on drugs with 
and without a desired therapeutic effect. For this approach to be statistically sound, it requires a large number of 
drugs in both classes. However, given few or no approved drugs for the diseases of highest medical urgency and 
interest, different strategies need to be investigated. 

Results: We developed a computational method termed "drug-protein interaction-based repurposing" (DPIR) that is 
potentially applicable to diseases with very few approved drugs. The method, based on genome-wide drug-protein 
interaction information and Bayesian statistics, first identifies drug-protein interactions associated with a desired 
therapeutic effect. Then, it uses key drug-protein interactions to score other drugs for their potential to have the 
same therapeutic effect. 

Conclusions: Detailed cross-validation studies using United States Food and Drug Administration-approved drugs 
for hypertension, human immunodeficiency virus, and malaria indicated that DPIR provides robust predictions. It 
achieves high levels of enrichment of drugs approved for a disease even with models developed based on a single 
drug known to treat the disease. Analysis of our model predictions also indicated that the method is potentially useful 
for understanding molecular mechanisms of drug action and for identifying protein targets that may potentiate the 
desired therapeutic effects of other drugs (combination therapies). 
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Background 

By conservative estimates, we know the molecular basis 
of more than 4,000 human diseases, whereas treatments 
are available for only about 250 of them [1]. The modern 
drug discovery paradigm, i.e., starting with a disease tar- 
get and looking for a highly selective small molecule that 
interacts strongly only with the intended target, is strug- 
gling to meet our medical and social requirements [2], 
Current estimates indicate that it takes an average of 
14 years at a cost of close to $2 billion to bring a new, 
safe, and efficacious drug to market [3]. In the process, 
more than 90% of the drug candidates fail due to safety 
concerns, inadequate bioavailability, or lack of efficacy [3]. 
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In reality, highly selective compounds are rare. The large 
number of safety and bioavailability issues facing candi- 
date drugs, as well as reported side effects of marketed 
drugs, is a reflection of many undocumented and deleteri- 
ous interactions between drugs and human targets. In 
addition, many underlying disease causes are multifactor- 
ial and can be due to dysfunctional processes involving 
multiple biomolecules. Even though significant efforts are 
devoted to understanding the molecular details of drug 
action, the fact is that the mechanisms of action of 
many efficacious drugs are poorly understood and, in 
many cases, remain largely unknown [4]. 

Drug interactions with unintended targets may lead to 
devastating side effects. However, these interactions may 
also signal the possibility for a drug to have therapeutic 
potential for diseases other than those for which it was 
approved. Indeed, there are many examples of a drug 
developed for a specific disease that was later approved 
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for treating an unrelated disease [5,6]. One of the most 
dramatic examples is thalidomide, a drug first marketed 
as an anti-nausea and sedative agent prescribed to treat 
morning sickness in pregnant women. Thalidomide was 
found to cause severe birth defects and was withdrawn 
from the market, but it later proved to have other 
therapeutic effects and was approved by the United 
States Food and Drug Administration (FDA) for skin le- 
sions caused by leprosy [7] and multiple myeloma [8]. 
In addition, thalidomide has shown promise in treating 
cutaneous lupus and Behcet's disease, human immuno- 
deficiency virus (HlV)-related mouth and throat ulcers, 
and blood and bone marrow cancers [9]. Although the 
recorded effects of thalidomide are multifaceted, with mul- 
tiple underlying mechanisms possible [10], clarification of 
a mechanism [11] that distinguishes between the terato- 
genic and anticancer therapeutic effects of thalidomide 
[12,13] was only recendy identified. The success of drug re- 
purposing, i.e., finding new uses for existing drugs, via ser- 
endipitous discoveries inspired the development of many 
computational approaches for the discovery of the yet un- 
known therapeutic potentials of existing drugs [14-24]. 

A common approach used in drug repurposing is to 
build a binary classifier based on a training set consisting 
of drugs with and without a desired therapeutic effect as 
the positive and negative classes, respectively. One of the 
requirements for developing a statistically sound classifier 
is the availability of a relatively large number of drugs in 
the training set. Furthermore, it is desirable to have an 
equal number of drugs with and without the desired thera- 
peutic effect in the training set so as not to bias classifier 
training. However, when there are many drugs approved 
for treating a disease, the need for discovering more drugs 
for the same disease is less than that for a disease for which 
there is a very limited number of drugs or no drugs at all. 
Thus, there is a need to develop computational drug repur- 
posing methods that can be applied to diseases for which 
there are very few known pharmacological options. 

In this article, we report on the development of a com- 
putational approach termed "drug-protein interaction- 
based repurposing" (DPIR) that is potentially applicable 
to diseases for which a very limited number of drugs are 
available. We based DPIR on large-scale drug-protein 
interaction profiles and Bayesian statistics to decipher 
the drug-protein interactions indicative of a desired thera- 
peutic effect. These drug-protein interactions are then 
used to identify other drugs that are likely to have the 
same therapeutic effect. 

Methods 

Source of large-scale chemical-protein interaction 
information 

To create large-scale chemical-protein interaction pro- 
files for FDA-approved drugs and drug development 



candidates, we exploited the Search Tool for Interactions 
of Chemicals (STITCH) database [25]. The October 
2013 release of the database (STITCH 3.1) contains 
chemical-protein interaction information, derived from a 
broad range of sources, between 300,000 small mole- 
cules and 2.6 million proteins from 1,133 organisms. 
The database provides a confidence measure for each 
chemical-protein interaction calculated by the equation 
score = 1 - n,(l - Pi), with corrections that take into ac- 
count the possibility of observing an interaction by chance 
[25]. In the equation, p t denotes the confidence of inter- 
action from the i'-th information source. Based on STITCH, 
a score between 0.40 and 0.70 indicates medium confi- 
dence, between 0.70 and 0.90 indicates high confidence, 
and between 0.90 and 1.00 indicates the highest confidence. 

To retain high-confidence chemical-protein interactions, 
we filtered out entries in STITCH 3.1 with confidence 
scores of <0.70. In addition, we removed all entries of 
chemical interaction with non-human proteins. The filter- 
ing reduced the total number of small molecule-protein 
interaction entries from >171 million to just over a half 
million. The categories of chemical-protein interactions 
with the highest occurrence in the database are binding 
(chemical binds to protein), inhibition (chemical inhibits 
protein function), and activation (chemical enhances pro- 
tein function). Because the therapeutic effects of most 
drugs are due to chemical modulation of protein function, 
functional information of chemical-protein interactions, 
i.e., inhibition or activation, is important. However, this 
information is not always available. Instead, the most 
prevalent type of interaction information is binding. To 
create drug-protein interaction profiles relevant for drug 
repurposing, we retained interactions of only these three 
categories. This left 445,162 interactions between chemi- 
cals identified by 232,765 unique STITCH chemical iden- 
tifiers and 6,399 unique human proteins. 

Source of FDA-approved drugs and drug development 
candidates 

To generate a list of FDA-approved drugs and drug devel- 
opment candidates, we retrieved the SMILES strings of all 
structurally unique small molecule compounds in Drug- 
Bank [26]. Molecular structures represented by the SMILES 
strings were standardized, i.e., we stripped salts, standard- 
ized charge representation, removed stereochemistry label- 
ing, removed single atom fragments, neutralized bonded 
zwitterions, and protonated acids/deprotonated bases. After 
structure standardization, we generated canonical SMILES 
and removed duplicates, resulting in 4,902 unique entries. 
They consisted of 1,163 FDA-approved drugs, 3,630 drug 
development candidates, 55 nutraceuticals, and 54 drugs 
withdrawn from market. These molecules are all referred to 
as "drugs" in the remainder of this article. 
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Computational prediction of drug-protein interactions 

Most of the compounds in DrugBank are in the biological 
activity screening libraries of pharmaceutical companies, 
government research laboratories, and academic institu- 
tions. However, not all of the DrugBank compounds have 
been tested in all assays evaluating chemical-protein inter- 
actions and, hence, the data collected in the STITCH 
database do not cover all drug-protein interactions. Thus, 
to create as complete drug-protein interaction profiles 
as possible, we complemented the drug-protein interac- 
tions contained in the STITCH database with predicted 
drug-protein interactions based on chemical structural 
similarity. This was accomplished by re-implementing 
the similarity ensemble approach (SEA) [27] and pre- 
dicting additional drug-protein interactions based on 
the collection of chemical-protein interactions con- 
tained in STITCH 3.1. SEA predictions are based on 
two-dimensional molecular structure similarity as mea- 
sured by Tanimoto coefficients between a drug mol- 
ecule and all known ligands of a protein. When the 
similarity score is high, the probability that the drug in- 
teracts with the same protein is high. In this study, we 
retained drug-protein interaction predictions with a p-vahie 
cutoff of 0.01, and combined these predictions with the 
high-confidence drug-protein interactions contained in the 
STITCH database. The so-constructed final set of drug- 
protein interactions is available for non-commercial use 
(via download at http://www.bhsai.org/downloads/drugre- 
purposing/). 

Creation of a machine-readable representation of 
drug-protein interaction profiles 

For the application of machine learning techniques, we 
created a binary bit-string representation of the drug- 
protein interaction profile for each drug. In the bit- 
string representation, each protein was assigned up to 
three bit positions to encode 1) drug interaction (bind- 
ing) with the protein, 2) drug activation of the protein, 
and 3) drug inhibition of the protein. If a drug was re- 
corded in STITCH and/or predicted to bind to a protein, 
the bit representing this drug-protein interaction was 
turned on, i.e., assigned a value of 1 (on-bit). Otherwise, 
the bit was turned off (assigned a value of 0). The bits 
representing drug activation and drug inhibition of a 
protein were similarly set. To reduce memory usage, 
when a bit was off for all drugs, the bit was removed. 
The length of the final binary string was 8,769 bits, 
representing drug interactions with 5,516 unique human 
proteins. If the SEA-predicted drug-protein interactions 
were excluded, the length of the final binary string 
would be 7,886, representing interactions between 4,369 
drugs and 5,003 unique human proteins. Thus, in our 
final drug-protein interaction profile dataset, the number 
of drugs whose high-confidence protein interaction 



information was not found in the STITCH 3.1 database, 
but predicted by SEA, was 533. In other words, SEA 
predictions contributed about 10% of the drug-protein 
interaction pairs in our final drug-protein interaction 
profiles. 

Method for identifying drug-protein interactions 
contributing to a desired therapeutic effect 

We found that the Laplacian-corrected Bayesian method 
of Xia et al. [28] was suitable for identifying drug- 
protein interactions indicative of a desired therapeutic 
effect. To illustrate this method, we used a collection of 
drug-protein interaction profiles, as shown in Figure 1. 
Given that M of the N drugs are approved for a certain 
disease, we assigned them to the positive class, i.e., those 
drugs that are known to have a desirable therapeutic ef- 
fect. The number of on-bits of bit feature among the 
positive class is denoted by A, and the number of on- 
bits of the same bit feature among all drugs is denoted 
by Bi. For drug repurposing, we need to estimate the 
conditional probability p(+\fi) of a drug-protein inter- 
action represented by bit feature/ to be responsible for 
the desired therapeutic effect of the drugs in the positive 
class. Bayes theorem gives the following: 



P(+\ft) 



P(+)xp(fi\+) f x£_ A , 



Pifi) 



(1) 



where P(+) is the prior probability for a drug to be posi- 
tive, p(fi\+) is the conditional probability of f t to be an 
on-bit in the positive class, and pifi) is the probability of 
ft to be an on-bit in all drugs. In most cases, Eq. 1 pro- 
vides a reasonable estimate of p(+\fi). However, when a 
bit feature is severely under-sampled, such as in a situ- 
ation where 5, = A t = 1, then p(+\fi) = 1.0, which provides 
an overly optimistic probability estimate. However, when 
more samples with this bit feature on are included in the 
data set, p{+ [//) will most likely decrease. Thus, for dis- 
eases with few drugs, we expect the number of drugs in 
the positive class to be very small, and therefore under- 
sampling of bit features may be common, leading to an 
inaccurate probability estimate of the predictive value of 
the bit feature. To correct for the effect of undersam- 
pling, one could assume that if this bit feature were sam- 
pled K more times, a reasonable estimate of the number 
of positive drugs would be KxP(+). Thus, the corrected 
probability is as follows: 



Pc(+\fi) 



At + [K x P{+)} 



Bi + K 



(2) 



This correction ensures that as B, — > 0 and A t — » 0, 
P (+1/D approaches the prior probability P(+). When K is 
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D2 100 001 001 000 000 000 010 110 001 001 

D3 000 110 000 000 100 000 Oil 000 010 000 

D4 010 100 000 001 000 000 100 100 000 000 

D5 Oil 000 010 100 001 000 001 001 100 000 

D6 000 100 110 010 100 101 001 100 100 100 

D7 100 100 000 000 001 000 100 100 000 010 
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Protein 1 Protein 2 • • • • 
Figure 1 Schematic bit-string representation of a drug-human protein interaction profile. Each protein is represented by 3 bits to encode 
drug binding, drug activation, and drug inhibition of the protein, respectively. When a drug has been reported or predicted to bind, activate, or 
inhibit a protein, the bit representing the specific drug-protein interaction is turned on (assigned a value of 1). Otherwise, the bit is off (assigned 
a value of 0). M denotes the number of drugs with an approved indication (positive class), N denotes the total number of drugs, f, represents 
on-bit or off-bit of the /-th bit feature, A, denotes the number of on-bits of the /-th bit feature in the positive class, and 6,- denotes the number of 
on-bits of the /-th bit feature in all drugs. 



set to 1/P(+), the adjustment corresponds to the Laplacian 
correction [29]. 

Following Xia et al. [28], we defined a weight w t to rep- 
resent the contribution of the 2-th bit- feature^ to the de- 
sired therapeutic effect of the positive class, as follows: 

Once w>i for every has been determined from a training 
set consisting of positive and negative classes of drugs, 
one can rank a given drug by its likelihood of having the 
desired therapeutic effect of the positive drug set based on 
the following score: 

score =Y^ i=1 WiXfi, (4) 

where equals 1 or 0, representing on and off bits, 
respectively. The higher the score a drug receives, the 
more likely it has the therapeutic effect of the positive 
class of the training set. In this formulation, the numeric 
score itself is never used to calculate an absolute prob- 
ability. Rather, it is only used to rank the likelihood of a 
drug in a given disease model as a possible repurposing 
candidate for that disease. 

Results and discussion 

Details of model development and quality assessment 

To assess performance of the drug repurposing method 
described above, we used three model development 
procedures, as shown in Figure 2. Type I model devel- 
opment represents a conventional machine learning 
process in which a data set is segregated into a training 
set and a testing set. The training set consists of a sub- 
set of samples of the positive class and a subset of 



samples of the negative class. The remaining samples, 
including both positive and negative samples, are 
grouped into the testing set. The model parameters are 
determined by the training set only. The model is then 
applied to the testing set to assess its ability to distinguish 
the positive from the negative samples. In principle, 
type I models are not suitable for drug repurposing ap- 
plications because most drugs were developed for 
treating a specific disease. Accordingly, for most drugs, 
their ability to treat other diseases has not been sys- 
tematically evaluated and, in most cases, one cannot 
confidently label true negative drugs (samples) in the 
training set. 

A more robust model development approach is repre- 
sented by a type II model, which is trained with a subset 
of the positive drugs as the positive class and all other 
drugs collected in a baseline class, i.e., a large set of 
compounds that may or may not include drugs with a 
desired therapeutic effect. Because all drugs are used for 
model development, there is no testing set. However, for 
drug repurposing, one can simply score all the drugs 
assigned to the baseline class with the model and evalu- 
ate the degree of enrichment of the (known) positive 
drugs in the highest-scored samples. Type II models are 
more appropriate than type I models for drug repur- 
posing, based on the premise that there exist drugs 
with yet unknown desirable therapeutic effects for a 
disease among the marketed drugs. 

Type III models are constructed for the purpose of 
examining the impact of false positives on model devel- 
opment. In this model-building process, k baseline drugs 
(those not known to have the same therapeutic effect of 
the drugs of the positive class) are purposely introduced 
in the positive class as false positives. 
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Figure 2 Three machine learning approaches for developing data-driven drug repurposing models. The total number of drugs and drug 
development candidates with drug-protein interaction profiles is 4,902. m denotes the number of drugs with a desirable therapeutic effect (positive 
class), n represents a subset of m used as the positive class of the training set for model development, and k denotes the number of drugs that do not 
have a desired therapeutic effect but can be used as false positives (FP) for the purpose of model development. TP: true positive. 



DPIR prediction analysis 

To assess the performance of the proposed drug repur- 
posing method, we performed detailed cross-validation 
analysis using FDA-approved drugs for hypertension, 
HIV, and malaria. These diseases were selected based on 
two criteria: the disease is clinically well defined and a 
significant number of FDA-approved drugs are available 
for cross-validation analyses. Note that our goal was to 
develop a method that can be applied to a disease with 
as few as one approved drug. However, for evaluating 
the performance of the DPIR method and establishing 
the advantages and disadvantages of different model 
training processes, we needed to use diseases with a sig- 
nificant number of approved drugs. 

For hypertension, 55 single-component (non-combin- 
ation) drugs on the National Institutes of Health (NIH) 
high blood pressure (HBP) drug list [30] have drug-protein 
interaction information. For HIV, 20 single-component 
drugs on FDA's HIV drug list [31] have drug-protein inter- 
action information. For malaria, 7 single-component FDA- 
approved drugs were listed on the Centers for Disease 
Control and Prevention (CDC) malaria treatment Web site 
[32]. In addition, we searched DrugBank and identified 4 
additional drugs that were approved for treating malaria 
but are not in the current CDC's list because of drug resist- 
ance, severe side effects, or because they were replaced by 
newer generations of antimalarial drugs of the same class. 



We included these 4 drugs to define a set of 11 antimalarial 
drugs for analysis. The names of these drugs, their molecu- 
lar structures in the form of SMILES strings, and the num- 
ber of on-bits in the drug-protein interaction profiles are 
available via publicly available download at http://www. 
bhsai.org/downloads/drugrepurposing. Even though the 
bit-length of the drug-protein interaction profiles is 8,769, 
for a specific drug the number of on-bits is much lower. It 
depends on how many proteins a drug interacts with and 
how many of these interactions have been identified. Thus, 
newer drugs tend to have a lower number of on-bits, as 
the synthetic routes may be under patent coverage and, 
therefore, the compounds are less likely to be found in the 
catalogs of chemical suppliers. Compounds extracted from 
natural products (such as some antimalarial drugs) also 
tend to have lower numbers of on-bits, simply because 
these compounds are more expensive to acquire than their 
synthetic counterparts and, therefore, are found in fewer 
screening libraries. Among the 20 HIV drugs, the number 
of on-bits ranges from 1 (etravirine, approved by the FDA 
in 2008) to 42 (ritonavir, approved by the FDA in 1996), 
with the average number of on-bits being 12. Among 
the 11 antimalarial drugs, the number of on-bits ranges 
from 1 (lumefantrine) to 57 (quinine) with an average 
of 20. Among the 55 hypertension drugs, the number 
of on-bits ranges from 4 (fosinopril) to 111 (verapamil) 
with an average of 36. 
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Using the drugs for each disease, we performed complete 
cross validation by training type I, II, and III models using 
one, two, or three approved drugs in the positive classes. 
To perform complete cross validation of models developed 
using a single positive drug in the positive class, each and 
every positive drug was used once as the positive class to 
build a model. For complete cross validation of models de- 
veloped with two (three) positive drugs in the positive 
class, each unique pair (triplet) of the positive drugs was 
used once as the positive class to develop a model. With 
an increasing number of positive drugs and increasing 
number of positive drugs assigned to the positive class 
for model development, the total number of models for 
complete cross validation increases dramatically. For 
instance, with 55 HBP drugs for hypertension, complete 
cross validation for models developed with 3 positive 
drugs required training of 26,235 models. Despite the 
large number of models to train, the DPIR method is 
computationally efficient as there is no regression or 
optimization of any parameters required. We assessed 
the performance of the models by degree of enrichment 
of positive drugs in the top-ranked testing sets or base- 
line classes. 



Performance of type I models 

Figure 3A-C shows degrees of enrichment of positive 
drugs by type I models for hypertension, HIV, and anti- 
malarial drugs. These results show that with models de- 
veloped using a single positive drug in the positive class 
of the training set, on average, between 15% and 20% of 
positive drugs in the testing sets were in the highest- 
scored 1% of the tested samples, between 30% and 40% 
of positive drugs were in the highest-scored 5% of the 
tested samples, and between 40% and 60% of positive 
drugs were in the highest-scored 10% of the tested sam- 
ples. The corresponding enrichments by models devel- 
oped from two or three positive drugs in the positive 
class of the training sets were slightly higher, but, more 
importantly, they exhibited lower variability, as indicated 
by decreasing standard deviations of the fraction of posi- 
tive drugs in the highest-scored testing samples. This is 
because increasing the sample size (number of positive 
drugs in the positive class) increases the signal-to-noise 
ratio, which improves Bayesian statistics. 

The expected fractions of positive drugs randomly dis- 
tributed in the testing sets were 1%, 5%, and 10% in the 
1%, 5%, and 10% of testing samples, respectively. Com- 
pared with a random selection, type I models achieved 
significant enrichment, especially considering the small 
number of positive drugs (one, two, or three) present in 
the positive class for model development. The corre- 
sponding p-vahies for achieving such levels of enrich- 
ment by chance rang ed from 6.5 x 10" 3 to 1.8 x 10" 28 . 



Performance of type II models 

Figure 3D-F shows the results of type II models for the three 
diseases. Even though a large number of positive drugs for 
each disease are in the baseline class and play the role of 
false negatives, they had little impact on model performance. 
The bar heights of type I and type II models for each disease 
and the standard deviations were nearly identical. The lack 
of appreciable impact of the false negatives was at first sur- 
prising. However, as the number of false negatives was very 
small compared with the total number of compounds in the 
various baseline classes, their effect was minimal. As a re- 
sult, the false negatives had negligible impact on the weight 
Wi associated with each drug-protein interaction. Thus, the 
results shown in Figure 3 indicate that false negatives have 
a negligible impact on the model development. 

Impact of false positives on model performance 

Figure 4 shows the comparison between type II and type 
III models. To develop the type III models, we included 
one false positive drug in the positive class for model 
building. The comparison shows that a false positive in 
the training set had a significant negative impact on model 
quality. Models developed with one true positive and one 
false positive in the positive class performed no better 
than picking the drugs randomly. Increasing the number 
of true positives in the positive class improved model per- 
formance; the improvement was not very significant as 
the enrichment rates of type III models were significantly 
lower than the corresponding enrichment rates of type II 
models. The practical implication is that the drugs used 
for the positive class in the model development should 
truly be associated with the desired therapeutic effect in 
order for the repurposing method to work efficiently. 

Model performance without SEA-predicted drug-protein 
interactions 

To examine the impact of SEA-predicted drug-protein in- 
teractions on model performance, we re-created drug- 
protein interaction profiles using information only con- 
tained in the STITCH database and used the profiles to re- 
peat the cross-validation calculations of type II models. 
Table 1 shows the results in comparison with those ob- 
tained using drug-protein interaction information from the 
STITCH database augmented by SEA predictions. For the 
hypertension and HIV models, the enrichment rates ob- 
tained with and without SEA-predicted drug-protein inter- 
actions were very close, with a slight increase in retrieved 
drugs when adding the SEA predictions. For malaria, how- 
ever, the models that did not use SEA-predicted drug- 
protein interactions performed significantly better. The 
primary reason for this was that 6 out of the 11 malaria 
drugs shared the same mechanism of action. They con- 
sisted of quinine, a natural product extracted from the 
bark of cinchona trees, and 5 synthetic analogs of 
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Figure 3 Performance comparison between type I and type II models. Comparison of enrichment efficiencies of type I (A-C) and type II 
(D-F) models for high blood pressure (HBP), HIV, and antimalarial drugs. The models were built with one, two, and three drugs in the positive 
class of the training set. Bar heights denote the fraction of FDA-approved HBP, HIV, and antimalarial drugs in the testing set (type I models) or 
baseline class (type II models) that scored in the highest 1%, 5%, and 10% of the compounds, respectively. Error bars represent 1 standard deviation 
from full cross-validation calculations. 



quinine: amodiaquine, mefloquine, chloroquine, hydro- 
xychloroquine, and primaquine. Because they shared 
the same mechanism of action, this mechanism of ac- 
tion was already well represented by the drug-protein 
interactions in the STITCH database. Thus, for this 
model, the additional SEA-predicted interactions intro- 
duced mechanistic noise through irrelevant interactions 
that led to a degraded model performance. 

In summary, the main contribution of the SEA predic- 
tions was to make -10% of the drugs whose protein inter- 
action information was not available in STITCH 3.1 
amenable to repurposing. As an example, among the 20 



HIV drugs, high-confidence drug-protein interaction in- 
formation for fosamprenavir, a pro-drug, was not present 
in STITCH 3.1. Therefore, this compound would not be 
amenable to repurposing by DPIR without including the 
SEA-predicted interactions. The additional 11 protein in- 
teractions with fosamprenavir, derived from the SEA pre- 
dictions, allowed us to include it in our modeling analyses. 

Impact of increasing number of drugs in the positive 
class on model performance 

In the above analyses, all models were built with one to 
three positive drugs to simulate the situation of a disease 
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Figure 4 Impact of false positive on the performance of type II models. The models were built with one to three true positives and either 
zero or one false positive. Error bars represent 1 standard deviation from full cross-validation calculations. The random bar heights represent the 
expected fractions of positive drugs in 1%, 5%, and 10% randomly picked baseline compounds. Models constructed with no false positives 
correspond to type II models (Figure 3,D-F). A: High blood pressure (HBP) model. B: Human immunodeficiency virus (HIV) model. C: antimalarial 
model. TP: true positive. FP: false positive. 



with very few approved drugs. To evaluate the impact of 
increasing the number of positive drugs on model per- 
formance, we also performed cross-validation calculations 
using 1, 5, 11, and 54 hypertension drugs in the positive 
class for type II model development, and we evaluated the 
enrichment power of the resulting models for the hyper- 
tension drugs left in the baseline class. On average, among 
the highest-scored 1% baseline compounds, there were 
13%, 25%, 33%, and 56% hypertension drugs with models 
developed with 1, 5, 11, and 54 hypertension drugs. The 
corresponding hypertension drugs in the highest-scored 



5% baseline compounds were 36%, 61%, 71%, and 85%, re- 
spectively, and the corresponding hypertension drugs in 
the highest-scored 10% baseline compounds were 50%, 
77%, 85%, and 100%. We obtained similar results from the 
other models (data not shown). Thus, as expected, in- 
creasing the number of compounds in the training set in- 
creased the number of positive drugs retrieved. 

Pharmacological information of top-scored baseline drugs 

The analyses above indicate that models derived by in- 
creasing the size of the positive class have increasingly 
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Table 1 Impact of SEA-predicted drug-protein 
interactions on Type II model performance 3 







With SEA 


Without SEA 


Evaluation of 
top-ranking scores (%) 


Drugs in Fraction 
training set 


a 


Fraction 


o 


High Blood Pressure model 


1 


1 


0.14 


0.06 


0.15 


0.07 


5 


1 


0.36 


0.11 


0.36 


0.11 


10 


1 


0.50 


0.14 


0.48 


0.14 


1 


2 


0.14 


0.06 


0.18 


0.07 


5 


2 


0.48 


0.09 


0.45 


0.10 


10 


2 


0.64 


0.09 


0.60 


0.11 


1 


3 


0.19 


0.06 


0.20 


0.07 


5 


3 


0.53 


0.08 


0.52 


0.10 


10 


3 


0.70 


0.08 


0.67 


0.10 


HIV model 


1 


1 


0.16 


0.09 


0.20 


0.19 


5 


1 


0.28 


0.14 


0.30 


0.18 


10 


1 


0.37 


0.17 


0.39 


0.21 


1 


2 


0.17 


0.07 


0.18 


0.07 


5 


2 


0.32 


0.12 


0.33 


0.10 


10 


2 


0.44 


0.13 


0.43 


0.15 


1 


3 


0.19 


0.07 


0.21 


0.07 


5 


3 


0.34 


0.11 


0.37 


0.09 


10 


3 


0.48 


0.10 


0.50 


0.12 


Malaria model 


1 


1 


0.14 


0.07 


0.20 


0.24 


5 


1 


0.34 


0.14 


0.54 


0.25 


10 


1 


0.70 


0.22 


0.79 


0.22 


1 


2 


0.20 


0.13 


0.20 


0.13 


5 


2 


0.43 


0.20 


0.58 


0.16 


10 


2 


0.73 


0.15 


0.89 


0.10 


1 


3 


0.21 


0.09 


0.21 


0.10 


5 


3 


0.49 


0.15 


0.66 


0.13 


10 


3 


0.77 


0.13 


0.91 


0.07 



a The type II models were built with one, two, and three positive drugs in the 
positive class of the training set. The fraction of positive drugs in the baseline 
class that scored in the highest 1%, 5%, and 10% of the compounds are recorded 
for models using the STITCH 3.1 database with and without SEA-predicted 
drug-protein interactions. Fraction, fraction of known drugs retrieved; 
o, standard deviation; SEA, similarity ensemble approach. 

better performance. In practical applications, one should 
include all drugs approved for an indication/disease in 
the positive class of the training process. To this end, we 
developed our final models for hypertension, HIV, and 
malaria by including all the drugs for each disease in the 
respective positive class. We used the resulting type II 
models to score the baseline compounds and we 
assessed the predictive power of the models by perform- 
ing literature searches of the relevant information for 



the 10 highest-scored FDA-approved drugs for each dis- 
ease. Tables 2, 3 and 4 summarize the results. 

Because we included all the drugs from the NIH's 
hypertension drug list in the positive class for model de- 
velopment, the baseline class did not include any of 
these drugs. As shown in Table 2, among the 10 highest- 
scored FDA-approved drugs in the baseline class, six of 
them have hypertension as part of their approved indica- 
tions based on information contained in DrugBank [26]. 
In addition, a 1998 clinical study observed that nimodi- 
pine, the second highest-scored baseline drug, reduced 
systolic and diastolic blood pressures of the central ret- 
inal artery, and it therefore may be a candidate for 
pregnancy- induced hypertension therapy [33]. However, 
information from DrugBank indicates that norepineph- 
rine, the sixth highest-scored drug in Table 2, has been 
approved for the treatment of critical hypotension, and 
yohimbine was approved for the treatment of impotence. 
In addition, epinephrine was approved for asthma and 
cardiac failure, and a 2002 review found that use of this 
drug was associated with small non-significant increases 
in systolic and diastolic blood pressure [34]. It seems 
that these three drugs induce a blood pressure increase, 
contrary to the expected therapeutic effect of hyperten- 
sion drugs. This apparent contradiction is rationalized 
because functional information for many drug-protein 
interactions contained in the STITCH database is in- 
complete. Although these drugs and the hypertension 
drugs interact with the same proteins, they induce an 
opposite functional outcome, i.e., activation instead of 
inhibition of the targeted proteins. Because of the lack of 
specific functional information, drugs interacting with 
the same proteins annotated as "binding" in the STITCH 
database were scored high by the algorithm, even though 
they may have opposite therapeutic effects. We similarly 
observed the same effect in the highest-scored baseline 
drugs identified by the HIV and malarial models dis- 
cussed below. 

Table 3 shows the results for the 10 baseline drugs 
scored highest by the HIV type II model. Among them, 
amprenavir was approved by the FDA as a treatment for 
HIV-1 infection in combination therapy with other anti- 
HIV drugs, even though it is not approved as a mono- 
therapy for HIV. Surprisingly, two statins, atorvastatin 
and lovastatin, scored among the top- 10 drugs, suggest- 
ing that they affect proteins that are also targeted by 
anti-HIV drugs. Indeed, our literature search found a 
2004 study that concluded that statins inhibit HIV-1 in- 
fection by downregulating Rho activity [35]. The anti- 
inflammation drug dexamethasone also scored high. Our 
search identified a 2001 study that found that dexa- 
methasone inhibits CD4 T cell death mediated by mac- 
rophages from HIV-infected persons [36]. Toward the 
end of an HIV infection, the number of functional CD4 
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Table 2 Therapeutic information of the drugs assigned to the baseline class that were scored highest by the 
hypertension model 3 

Generic name Score Information from DrugBank b 

Nitrendipine 103.9 For mild to moderate hypertension. 



Information from other sources 



Nimodipine 



93.9 For use as an adjunct to improve neurologic outcome following A 1998 clinical study [33] observed that nimodipine 



reduces systolic and diastolic blood pressures of the 
central retinal artery and therefore may provide another 
therapeutic choice for pregnancy-induced hypertension. 



Alprenolol 

Nilvadipine 

Oxprenolol 



Norepinephrine 53.8 



subarachnoid hemorrhage. 

69.7 For hypertension, angina, and arrhythmia. 

69.2 For vasospastic angina, chronic stable angina, and hypertension. 

69.1 For the treatment of hypertension, angina pectoris, arrhythmias, 
and anxiety. 

For patients in vasodilatory shock states, also used as a 
vasopressor medication for patients with critical hypotension. 

Spirapril 52.1 An ACE inhibitor for hypertension. 

Lercanidipine 46.5 For hypertension, angina pectoris, and Raynaud's syndrome. 
Yohimbine 44.7 Used as a mydriatic and for the treatment of impotence. 

Epinephrine 43.9 Used in asthma and cardiac failure and to delay absorption of A systematic review in 2002 [34] found that the use of 
local anesthetics. epinephrine in uncontrolled hypertensive patients was 

associated with small, non-significant increases in systolic 
and diastolic blood pressure. 

a The model was developed with all 55 drugs in the National Institutes of Health hypertension drug list in the positive class; the remaining drugs and drug 
development candidates were potential repurposing candidates in the baseline class. 
b Ref. [26]. 

ACE: angiotensin-converting enzyme. 

T cells falls, which leads to the symptomatic stage of report for its putative anti-HIV activity. Instead, a 1991 
AIDS. The observation that dexamethasone inhibits study observed that, in high concentration, verapamil 
CD4 T cell death supports the predictive power of our can potentiate HIV-1 replication in lymphoid cells [37]. 
model. However, our search for the highest-scored base- Its interaction with nuclear factor K-light-chain enhancer of 
line drug, verapamil, did not find a convincing literature activated B cells (NF-kB) was suggested to be responsible 



Table 3 Therapeutic information of the drugs assigned to the baseline class that were scored highest by the HIV model 3 



Generic name 


Score 


Information from DrugBank b 


Information from other sources 


Verapamil 


23.5 


For hypertension, angina, and cluster headache prophylaxis. 


A 1991 study [37] observed that verapamil at 
high concentration can potentiate HIV-1 
replication in lymphoid cells. 


Clotrimazole 


22.6 


For oropharyngeal candidiasis, vaginal yeast infections, 
and fungal infections. 




Ketoconazole 


21.5 


For fungal infections. 




Dexamethasone 


21.0 


An anti-inflammatory 9-fluoro-glucocorticoid. 


A 2001 study [36] found that dexamethasone 
inhibits CD4 T cell death mediated by macrophages 
from HIV-infected persons. 


Amprenavir 


19.6 


For treatment of HIV-1 infection in combination with 
other antiretroviral agents. 




Atorvastatin 


19.3 


For hypercholesterolemia. 


A 2004 study [35] concluded that statins inhibit HIV-1 
infection by downregulating Rho activity. 


Clarithromycin 


18.5 


Antibiotic. 




Lovastatin 


18.2 


For hypercholesterolemia. 


A 2004 study [35] concluded that statins inhibit HIV-1 
infection by downregulating Rho activity. 


Quinidine barbiturate 


18.1 


For the treatment of ventricular pre-excitation 
and cardiac dysrhythmias. 




Cimetidine 


17.9 


For acid-reflux disorders (GERD), peptic ulcer disease, 
heartburn, and acid indigestion. 





a The model was developed with all 20 HIV drugs in the FDA's HIV drug list in the positive class; the remaining compounds were potential repurposing candidates 
in the baseline class. 
b Ref. [26]. 
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Table 4 Therapeutic information of the drugs assigned to the baseline class that were scored highest by the malaria 
model 3 



Generic name 


Score 


Information from DrugBank b 


Information from other sources 


Dexamethasone 


22.1 


An anti-inflammatory 9-fluoro-glucocorticoid. 


Dexamethasone was reported to have a dramatic life-saving effect on 
people with cerebral malaria [44]. However, two subsequent placebo- 
controlled c inica trials failed to demonstrate clinical benefit [45,46]. 


Verapamil 


21.5 


A calcium channel blocker for hypertension, 
angina, and cluster headache prophylaxis. 


A 1995 study [48] reported that verapamil reverses chloroquine resistance 
in the malaria parasite. 


Quercetin 


18.9 


A flavonol found in plants, antioxidant. 


A 2012 study [47] reported that quercetin had antiplasmodial activity. 


Miconazole 


16.5 


An imidazole antifungal agent. 


Many studies [39-43] have reported antimalarial activities of antifungal 
azo e compounds. 


Clotrimazole 


15.9 


An imidazole derivative with a broad 
spectrum of antimycotic activity. 


Many studies [39-43] have reported antimalarial activities of azoles, 
including clotrimazole [43]. 


Cimetidine 


15.5 


For acid-reflux disorders (GERD), peptic ulcer 
disease, heartburn, and acid indigestion. 


A 1997 study [49] reported synergism of cimetidine with antimalaria 
agents. It is ineffective when used alone. 


Ketoconazole 


15.2 


For systemic fungal infections. 


Many studies [39-43] have reported antimalarial activities of azoles, 
including ketoconazole [41]. 


Nifedipine 


14.9 


A calcium channel blocker for angina, 
hypertension, and Raynaud's phenomenon. 




Tamoxifen 


14.4 


For breast cancer. 




Clobetasol 


14.4 


For corticosteroid-responsive dermatoses of 
the scalp. 





a The model was developed with all 1 1 antimalarial drugs in the positive class; the remaining compounds were potential repurposing candidates in the 
baseline class. 
b Ref. [26]. 



for this effect. NF-kB was later considered a potential target 
of anti-HIV chemotherapy by Pande and Ramos [38]. The 
high scoring of verapamil in our model seems to support 
NF-kB as a target for anti-HIV drugs. The reason we mis- 
identified the therapeutic effect of verapamil may again 
be partly rationalized by the lack of functional informa- 
tion (activation or inhibition) of drug-NF-icB inter- 
action of the anti-HIV drugs in the STITCH database. 

Table 4 shows the 10 drugs in the baseline class that 
scored highest by the antimalarial model. None of them 
has been approved by the FDA for the treatment of 
malaria. However, three of them are azole compounds 
(miconazole, clotrimazole, and ketoconazole), and many 
studies that have reported antimalarial activities of azole 
compounds [39-43] lend support to our model predictions. 
The highest-scored non-malarial drug is dexamethasone. It 
has been reported to have dramatic life-saving effects in 
people with cerebral malaria [44]. However, these effects 
were not observed in subsequent placebo-controlled 
clinical trials [45,46]. A flavonol nutraceutical, quercetin, 
scored third-highest, as shown in Table 3. It was reported 
in 2002 to have antiplasmodial activity [47], corroborating 
our model prediction. 

It is interesting to note that two other drugs shown in 
Table 4, verapamil and cimetidine, do not have antimal- 
arial activities themselves but exhibit synergism when 
used in combination with antimalarial drugs [48,49]. In 
the case of verapamil, it was reported that it reverses 
chloroquine resistance in the malaria parasite [48]. To 



understand why these two drugs scored high, we inves- 
tigated the most frequently occurring drug-protein in- 
teractions among the antimalarial drugs. Among the 11 
antimalarial drugs, 10 are cytochrome P-450 (CYP)2D6 
inhibitors, 4 are CYP1A2 inhibitors, 4 are CYP3A4 in- 
hibitors, 3 are CYP2C9 inhibitors, and 3 are CYP1A1 
activators. Because of their high frequencies among the 
antimalarial drugs, according to Eq. 3, the drug-protein 
interactions contributing most to the antimalarial model 
relate to the inhibition of CYP proteins. However, the hu- 
man CYP proteins are unlikely malarial drug targets. 
Instead, the similarity between the CYP-binding site, the 
heme group, and malarial drug target may be the reason 
for the apparent importance of CYP inhibition for anti- 
malarial activity. Indeed, one of the proposed antimalarial 
mechanisms of 4-amino quinolones is the formation of a 
drug-heme complex that inhibits the polymerization and 
crystallization of free heme into hemozoin [50]. The mal- 
aria parasite digests hemoglobin and produces free heme, 
which is toxic to the parasite. The formation of hemozoin 
is a parasite detoxification pathway. Quinolones with fused 
bicyclic aromatic rings bind well with heme due to tt-tt 
stacking interactions and therefore inhibit hemozoin effi- 
ciently. Because heme is also the most important structural 
moiety of the CYP enzymes, the high affinity of quinolones 
to heme also makes them inhibitors of CYP enzymes. 
Verapamil and cimetidine were scored high by the anti- 
malarial model because both of them are inhibitors of 
CYP2D6, CYP3A4, and CYP2C9. As the compounds do 
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not have a fused bicyclic aromatic moiety, they do not bind 
well with free heme and, hence, do not have antimalarial 
activity by themselves. The reason they enhance the activ- 
ity of antimalarial drugs is more likely a direct effect of 
their inhibition of Pghl, an ATP-binding cassette protein 
functioning as a membrane efflux pump [51]. It reduces 
antimalarial drugs in the parasite food vacuole and there- 
fore results in drug resistance. The human homologue of 
Pghl is P-glycoprotein, a well-known multidrug-resistant 
protein. Both verapamil and cimetidine are inhibitors of P- 
glycoprotein and, therefore, are also likely inhibitors of 
Pghl; thus, co-administration of these drugs with trad- 
itional malarial drugs should result in enhanced antimalar- 
ial activity. 

It is interesting to note that five of the ten entries in 
Tables 3 and 4 are overlapping. This was pardy a reflection 
of the five drugs having an above-average number of on- 
bits in their protein interaction profiles, effectively leading 
to more off-target effects, including a heightened potential 
for therapeutic repurposing. The numbers of on-bits in the 
drug-protein interaction profiles were 445, 111, 65, 44, and 



40 for dexamethasone, verapamil, cimetidine, ketoconazole, 
and clotrimazole, respectively, whereas the average number 
of on-bits of all 4,902 compounds was 38. 

In summary, the results shown in Tables 2, 3 and 4 
support the conclusions we derived from the results 
shown in Figures 2 and 3: that the performance of the 
drug-repurposing method developed in this project is 
robust and may have practical utility. The high-scoring 
drugs derived for each disease not only included drugs 
approved for the disease but also molecules identified in 
the literature as potential drugs against the disease. Thus, 
we can propose top-scored drugs with no recorded indica- 
tion for the disease as novel drug repurposing candidates 
for the disease. In addition, the methodology can be used 
to characterize and decipher the mode of drug action and 
identify protein targets that may enhance the effect of 
existing drugs via synergistic drug combinations. 

Application to a condition with only one approved drug 

Impaired blood clotting may lead to hemorrhagic shock 
and contribute to fatalities resulting from traumatic 



Table 5 Structures and therapeutic information of tranexamic acid and the highest-scored drugs by the tranexamic 
acid model 3 



Structure 



Generic name Score DrugBank indication 



Information from other sources 



Tranexamic Acid For preventing hemorrhage in trauma and for The only drug used to build the DPIR 

excessive bleeding during and following prediction model, 

tooth extraction, surgery, and menstruation. 



a- 



H 2 N — =" 



HUM. 



q Aminocaproic 
acid 
"OH 



! .39 For the treatment of excessive postoperative 
bleeding. 




Amiloride 1 .38 For use as adjunctive treatment with thiazide 

diuretics or other kaliuretic-diuretic agents in 
congestive heart failure or hypertension. 



Amiloride was evaluated as a treatment for 
ameliorating trauma-hemorrhagic shock- 
induced lung injury in rats [53]. 



26 experimental drugs with scores between 1.38 and 0.69 
Oh Diethylstilbestrol 0.69 



For treatment of prostate cancer and 
prevention of miscarriage or premature 
delivery in pregnant women prone to 
miscarriage or premature delivery. 



Diethylstilbestrol was found to have particular 
clinical value in the treatment of certain 
functional gynecic aberrations. One of these is 
excessive or prolonged functional uterine 
bleeding [54]. 



HO 



a The model was developed with tranexamic acid as the only member of the positive class; all other compounds were in the baseline class. 
b Ref. [26]. 
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injuries. Currently, the only drug approved by the FDA for 
reducing or preventing hemorrhage in trauma is tranex- 
amic acid. We constructed a type II model using tranex- 
amic acid as the only drug in the positive class. We used 
the model to score all other compounds, and Table 5 
shows the highest-scoring compounds. The highest- 
scored baseline compound was aminocaproic acid, a drug 
for the treatment of excessive postoperative bleeding [52] . 
As an acyclic analog of tranexamic acid, aminocaproic 
acid and tranexamic acid have a high level of structural 
similarity. Therefore, it can be expected that they have 
similar therapeutic effects. 

Two other marketed drugs that ranked high were 
amiloride, approved for congestive heart failure and 
hypertension, and diethylstilbestrol, approved for the 
treatment of prostate cancer and the prevention of mis- 
carriage or premature delivery. Amiloride was evaluated 
for trauma-hemorrhagic shock-induced lung injury in a 
rat model [53], and diethylstilbestrol was studied for 
hemostasis in functional uterine hemorrhage and con- 
cluded to be of clinical value for excessive or prolonged 
functional uterine bleeding [54]. These observations are 
in line with the model predictions. 

Limitations on comparison with other methods 

The comparison of different computational drug repur- 
posing methods is limited by a lack of publicly available 
datasets and software systems. Further complications 
arise because different methods rely on different types 
of information, ranging from experimentally/clinically 
derived data (such as drug- and disease-induced gene 
expression changes, drug side effects, and drug-protein 
interactions) to calculated measures (such as drug-drug 
structure similarity, drug target protein sequence simi- 
larity, and semantic similarity of disease phenotypes). 
Thus, because most publications do not make software, 
training data, and test data containing all types of infor- 
mation available, there is little scope for proper methodo- 
logical comparison, i.e., method A versus method B on the 
same dataset using a generally accepted standard. To fa- 
cilitate future comparison with other methods, we have 
made our drug-protein interaction dataset publicly avail- 
able for download (see Methods). 

To enable a reference comparison, we compared the 
performance of the type II models with that of a chem- 
ical fingerprint similarity search on the same datasets. 
Thus, we performed cross-validation calculations using 
one, two, and three positive drugs as the positive classes, 
and the remaining compounds, including other positive 
drugs, as the test compounds. We used the Tanimoto 
chemical similarity between a test compound and a posi- 
tive class compound, calculated by using Accelrys ECFP 4 
fingerprints, to rank order the test compounds. Table 6 
shows the calculated percentages of the positive drugs in 



Table 6 Comparison of DPIR and chemical fingerprint 
(similarity) search-based drug repurposing approaches 3 







DPIR 




Similarity 


Evaluation of Drugs in 
top-ranking scores (%) training set 


Fraction 


o 


Fraction 


o 


High Blood Pressure model 


1 


1 


0.14 


0.06 


0.09 


0.07 


5 


1 


0.36 


0.11 


0.16 


0.07 


10 


1 


0.50 


0.14 


0.24 


0.10 


1 


2 


0.14 


0.06 


0.14 


0.08 


5 


2 


0.48 


0.09 


0.22 


0.09 


10 


2 


0.64 


0.09 


0.29 


0.10 


1 


3 


0.19 


0.06 


0.18 


0.08 


5 


3 


0.53 


0.08 


0.27 


0.10 


10 


3 


0.70 


0.08 


0.33 


0.10 


HIV model 


1 


1 


0.16 


0.09 


0.08 


0.07 


5 


1 


0.28 


0.14 


0.20 


0.09 


10 


1 


0.37 


0.17 


0.25 


0.10 


1 


2 


0.17 


0.07 


0.11 


0.07 


5 


2 


0.32 


0.12 


0.28 


0.12 


10 


2 


0.44 


0.13 


0.36 


0.12 


1 


3 


0.19 


0.07 


0.13 


0.07 


5 


3 


0.34 


0.11 


0.32 


0.13 


10 


3 


0.48 


0.10 


0.44 


0.14 


Malaria model 


1 


1 


0.14 


0.07 


0.26 


0.18 


5 


1 


0.34 


0.14 


0.50 


0.21 


10 


1 


0.70 


0.22 


0.55 


0.23 


1 


2 


0.20 


0.13 


0.35 


0.16 


5 


2 


0.43 


0.20 


0.62 


0.16 


10 


2 


0.73 


0.15 


0.72 


0.16 


1 


3 


0.21 


0.09 


0.40 


0.14 


5 


3 


0.49 


0.15 


0.68 


0.14 


10 


3 


0.77 


0.13 


0.79 


0.12 



a The DPIR (type II) models were developed with one, two, or three positive 
drugs in the positive class of the training set. The chemical fingerprint search 
used Tanimoto similarity (TS) calculated with the Accelrys ECFP_4 fingerprint. 
The highest TS coefficient between a baseline compound and a positive 
compound was used to rank order the baseline compound. DPIR, drug-protein 
interaction-based repurposing; fraction, fraction of known drugs retrieved; 
a, standard deviation; bold font, highest fraction retrieved. 

the highest-ranked 1%, 5%, and 10% of the test set com- 
pounds for all three repurposing models. The chemical 
fingerprint searches provided significant enrichments, the 
success of which was largely determined by the presence 
or absence of chemically similar drugs in the approved 
drug set for each disease. For the hypertension and HIV 
drugs, the DPIR method consistently performed better 
than the structural similarity search. This was because the 
drugs used to construct the models were different from 
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each other and structural similarity was limited, whereas 
the DPIR methodology, based on capturing drug-protein 
interactions, did not require structural similarities. In con- 
trast, for malaria drugs the situation was reversed, with 
structural similarity being the most "successful" repurpos- 
ing strategy. However, this was primarily because 5 of the 
11 malaria drugs (~50%) were structurally very closely re- 
lated to quinine, the original malaria drug. They (amodia- 
quine, mefloquine, chloroquine, hydroxychloroquine, and 
primaquine) are synthetic analogs of quinine and therefore 
structurally very similar to it. Molecular structures of 
the hypertension, HIV, and malaria drugs are given in 
Additional file 1: Figures SI, Additional file 2: Figure S2, 
Additional file 3: Figure S3. They show that all malaria 
drugs can be grouped in one structural similarity cluster 
of six compounds and five singletons. The hypertension 
drugs belong to seven structural clusters and 27 single- 
tons, and the HIV drugs belong to six clusters and seven 
singletons. Thus, the hypertension and HIV drug sets rep- 
resent a more structurally diverse ensemble of drugs than 
the malaria drugs. 

In effect, DPIR could identify compounds whose activ- 
ity was due to structurally similar compounds having 
similar activity, as well as compounds containing differ- 
ent structural scaffolds but having similar protein inter- 
action profiles. This was also corroborated by the data 
shown in Table 5, where even though amiloride and di- 
ethylstilbestrol were structurally very different from 
tranexamic acid, DPIR successfully scored them high as 
potential repurposing candidates. 

Conclusions 

In this article, we described the development of a Bayesian 
statistics-based computational drug repurposing method 
termed DPIR and assessed its performance. We demon- 
strated that the method required very few known drugs to 
build a successful predictive model for test cases for which 
there are many approved drugs. We also demonstrated 
that for trauma-induced hemorrhage, for which only one 
FDA-approved drug is available, the method gave high 
scores to two drugs approved for unrelated indications, 
but with potential therapeutic effects against hemorrhage 
as supported by literature reports. These results indicate 
that DPIR is potentially applicable to diseases with as few 
as one approved drug, a challenging situation for methods 
based on a binary classifier approach. DPIR relies on large- 
scale drug-protein interaction information. In principle, 
if one knows the molecular mechanisms of a disease 
and the details of drug-protein interactions, one can 
predict whether a drug will have the desired therapeutic 
effect for a specific disease. However, details of molecu- 
lar mechanisms of drug action are not well understood 
and even unknown for many efficacious drugs, compli- 
cated by the fact that most drugs interact with a large 



number of proteins. Bayesian statistics provide a power- 
ful and unbiased approach to identify specific drug- 
protein interactions critical for a desired therapeutic 
effect. 

The DPIR method relies on the association of drug- 
protein interactions to capture and define the therapeutic 
effects of the drug based on the knowledge of at least one 
known drug. The underlying assumption in this approach 
is that the main effect of the drug is mediated through 
drug-protein interactions. These interactions create a char- 
acteristic profile, or signature, that can be used to find 
other drugs without direct knowledge of how these inter- 
actions contribute to disease treatment. Furthermore, as 
found in the case studies above, the true effect of the drug- 
protein interaction determines the therapeutic potential 
and cannot be predicted without accurate interaction an- 
notations. Thus, if the drug-protein activation or inhibition 
effect is only annotated as "binding", the model cannot re- 
solve the true nature of the interaction. Experimental veri- 
fication will, of course, always be required for probing the 
repurposing potential of any drug. 

Because of the large number of druggable targets that 
can be associated with human diseases, a major limita- 
tion of our method is availability of large-scale drug- 
protein interaction information. This is not a problem 
for drugs that have been on the market for a long time, 
as these compounds are increasingly likely to be in- 
cluded in different screening libraries and tested in mul- 
tiple bioassays over time. However, for newer drugs, 
information of their interactions with a broad spectrum 
of proteins is usually lacking and therefore may limit the 
applicability of the method. Yet with time, these com- 
pounds will be tested in more assays, and information 
about their interaction with human proteins will become 
available. As more and more screening results are depos- 
ited in the public domain, the aggregated data will become 
an invaluable resource for establishing whole genome- 
wide drug-protein interaction profiles and for discovering 
new uses for existing drugs. 
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Additional file 1: Figure SI. Hypertension drugs grouped by molecular 
structure similarity. Molecular structure similarity clusters of the hypertension 
drugs. 
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Additional file 3: Figure S3. Malaria drugs grouped by molecular 
structure similarity. Molecular structure similarity clusters of the malaria drugs. 
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